Introduction

This is an exploratory data analysis about the relationships between diabetes, obesity, leisure time inactivity, and the per capita income in the US counties. If you are interested, how the analysis was done see the file code.html.

Datasets and cleaning data

The US county income data is from Wikipedia page List of United States counties by per capita income. This diabetes dataset is from CDC Diabetes page County Data Indicators. The dataset contains statistics about diabetes, obesity, and leisure time inactivity in US counties between 2004-2013. Here, the data from year 2013 was used. Both datasets were downloaded on 2/3/2017.

The county names in both dataset were cleaned and made compatible so that the dataset could be joined.

Diabetes vs. per capita income

To examine the possible relationship between diabetes prevalence and per capita income in different US counties, we join the datasets using the state and county.

We plot the data as a scatterplot having one point for each county. The colors of the points are based on the population size. A linear regression line is shown too.

There seems to be a pattern that diabetes prevalence is higher in counties with smaller per capita income.

Obesity vs. per capita income

Similarly we join the obesity data with the income data, and plot it.

Also here, we seem to have be a pattern that obesity prevalence is higher in counties with smaller per capita income.

Leisure time inactivity vs. per capita income

Next we join the leisure time inactivity data with the income data, and plot it.

The pattern is similar as in the cases above.

Obesity vs. diabetes

It is interesting to see, if there is a relationship between obesity and diabetes is. To do this, we join the obesity and diabetes data, and plot the result.

Higher prevalence of obesity seems to be related to higher prevalene of diabetes.

Leisure time inactivity vs. diabetes

Similarly we examine the relationship between leisure time inactivity and diabetes.

Similar trend can be seen as above with obesity vs. diabetes.

Leisure time inactivity vs. obesity

Finally, we examine the relationship between leisure time inactivity and obesity.

Also here the trend is similar. Higher prevalence of leisure time inactivity seems to be related to higher prevalence of obesity.

Problems in doing the analysis

Lots of manual work was caused by the different forms of county names. There were systematic differences like omission of the word ‘County’ in the county names in the Wikipedia data and non-systematic differences like ‘De Witt’ vs. ‘DeWitt’.

The plotly package was not completely easy to use. One particularly strange feature came out, when I added the linear regression lines to the pictures. For some reason, plotly wanted to add the marker points to the regression line too (projecting the x-values of the original points to the regression line). These markers completely hid the actual line. It did not matter, whether I defined mode="lines" or mode="lines+markers". Finally I tried mode="lines-markers" and it removed the markers! However, I did not find this in the documentation.